01 Machine Learning Fundamentals

Author

Shahrokh Vahedi

Published

June 5, 2023

1 Challenge Summary

Your organization wants to know which companies are similar to each other to help in identifying potential customers of a SAAS software solution (e.g. Salesforce CRM or equivalent) in various segments of the market. The Sales Department is very interested in this analysis, which will help them more easily penetrate various market segments.

We will be using stock prices in this analysis. We come up with a method to classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help our organization determine which companies are related to each other (competitors and have similar attributes).

We can analyze the stock prices using what we’ve learned in the unsupervised learning tools including K-Means and UMAP. We will use a combination of kmeans() to find groups and umap() to visualize similarity of daily stock returns.

2 Objectives

Apply our knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to create a visualization that identifies subgroups in the S&P 500 Index. We will specifically apply:

  • Modeling: kmeans() and umap()
  • Iteration: purrr
  • Data Manipulation: dplyr, tidyr, and tibble
  • Visualization: ggplot2 (bonus plotly)

3 Libraries

Load the following libraries.

library(tidyverse)
library(tidyquant)
library(broom)
library(umap)
library(plotly)
library(dplyr)
library(magrittr) # needs to be run every time you start R and want to use %>%
library(lubridate)

4 Data

We will be using stock prices in this analysis. We can read in the stock prices. The data is 1.2M observations. The most important columns for our analysis are:

  • symbol: The stock ticker symbol that corresponds to a company’s stock price
  • date: The timestamp relating the symbol to the share price at that point in time
  • adjusted: The stock price, adjusted for any splits and dividends (we use this when analyzing stock data over long periods of time)
sp_500_index_tbl <- readRDS("data01/sp_500_index_tbl.rds")
sp_500_index_tbl
sp_500_prices_tbl <- readRDS("data01/sp_500_prices_tbl.rds")
sp_500_prices_tbl

5 Question

Which stock prices behave similarly?

Answering this question helps us understand which companies are related, and we can use clustering to help us answer it!

Even if we’re not interested in finance, this is still a great analysis because it will tell us which companies are competitors and which are likely in the same space (often called sectors) and can be categorized together. Bottom line - This analysis can help us better understand the dynamics of the market and competition, which is useful for all types of analyses from finance to sales to marketing.

Let’s get started.

5.1 Step 1 - Convert stock prices to a standardized format (daily returns)

What we first need to do is get the data in a format that can be converted to a “user-item” style matrix. The challenge here is to connect the dots between what we have and what we need to do to format it properly.

We know that in order to compare the data, it needs to be standardized or normalized. Why? Because we cannot compare values (stock prices) that are of completely different magnitudes. In order to standardize, we will convert from adjusted stock price (dollar value) to daily returns (percent change from previous day). Here is the formula.

\[ return_{daily} = \frac{price_{i}-price_{i-1}}{price_{i-1}} \] First, what do we have? We have stock prices for every stock in the SP 500 Index, which is the daily stock prices for over 500 stocks. The data set is over 1.2M observations.

sp_500_prices_tbl %>% glimpse()
#> Rows: 1,225,765
#> Columns: 8
#> $ symbol   <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT~
#> $ date     <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07, 2009-01-08, ~
#> $ open     <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71, 19.52, 19.53~
#> $ high     <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79, 19.99, 19.68~
#> $ low      <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30, 19.52, 19.01~
#> $ close    <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47, 19.82, 19.09~
#> $ volume   <dbl> 50084000, 61475200, 58083400, 72709900, 70255400, 49815300, 5~
#> $ adjusted <dbl> 15.86624, 16.01451, 16.20183, 15.22628, 15.70234, 15.23408, 1~
sp_500_daily_returns_tbl <- sp_500_prices_tbl %>%
  
  select(symbol, date, adjusted) %>%
  
  filter(date >= ymd("2018-01-01")) %>%
  
  group_by(symbol) %>%
  mutate(lag_1 = lag(adjusted)) %>%
  ungroup() %>%
  
  filter(!is.na(lag_1)) %>%
  
  mutate(diff = adjusted - lag_1) %>%
  mutate(pct_return = diff / lag_1) %>%
  
  select(symbol, date, pct_return)

# Apply your data transformation skills
sp_500_daily_returns_tbl
saveRDS(sp_500_daily_returns_tbl, file = "data01/sp_500_daily_returns_tbl.RDS") 

5.2 Step 2 - Convert to User-Item Format

stock_date_matrix_tbl <- readRDS("data01/sp_500_daily_returns_tbl.rds") %>%
  spread(key = date, value = pct_return, fill = 0)

stock_date_matrix_tbl

5.3 Step 3 - Perform K-Means Clustering

kmeans_obj <- stock_date_matrix_tbl %>%
  select(-symbol) %>%
  kmeans(centers = 4, nstart = 20)

kmeans_obj %>% glance()

5.4 Step 4 - Find the optimal value of K

# Lets use `purrr` to iterate over many values of "k" using the `centers` argument. 
kmeans_mapper <- function(center = 3) {
  stock_date_matrix_tbl %>%
    select(-symbol) %>%
    kmeans(centers = center, nstart = 20)
}

# Apply the `kmeans_mapper()` and `glance()` functions iteratively using `purrr`.

k_means_mapped_tbl <- tibble(centers = 1:30) %>%
  mutate(k_means = centers %>% map(kmeans_mapper)) %>%
  mutate(glance  = k_means %>% map(glance))

#Output: k_means_mapped_tbl
k_means_mapped_tbl

Scree Plot

Next, let’s visualize the “tot.withinss” from the glance output as a Scree Plot.

k_means_mapped_tbl %>%
  unnest(glance) %>%
  ggplot(aes(centers, tot.withinss)) +
  geom_point(color = "#3e502c") +
  geom_line(color = "#502c50") +
  labs(title = "Scree Plot") +
  theme_tq()

We can see that the Scree Plot becomes linear (constant rate of change) between 5 and 10 centers for K.

5.5 Step 5 - Apply UMAP

Next, let’s plot the UMAP 2D visualization to help us investigate cluster assignments.

# let's plot the `UMAP` 2D visualization to help us investigate cluster assignments. 
umap_results <- stock_date_matrix_tbl %>%
  select(-symbol) %>%
  umap()

# Next, we want to combine the `layout` from the `umap_results` with the `symbol` column from the `stock_date_matrix_tbl`. 
umap_results_tbl <- umap_results$layout %>%
  as_tibble() %>%
  bind_cols(stock_date_matrix_tbl %>% select(symbol)) 

umap_results_tbl
# Finally, let's make a quick visualization of the `umap_results_tbl`.

umap_results_tbl %>%
  ggplot(aes(V1, V2)) +
  geom_point(alpha = 0.5, color = "#2c3e50") +
  theme_tq() +
  labs(title = "UMAP Projection")

We can now see that we have some clusters. However, we still need to combine the K-Means clusters and the UMAP 2D representation.

5.6 Step 6 - Combine K-Means and UMAP

# Next, we combine the K-Means clusters and the UMAP 2D representation
# First, pull out the K-Means for 10 Centers. Use this since beyond this value the Scree Plot flattens. 

k_means_mapped_tbl <- read_rds("data01/k_means_mapped_tbl.rds")
umap_results_tbl   <- read_rds("data01/umap_results_tbl.rds")

k_means_obj <- k_means_mapped_tbl %>%
  filter(centers == 10) %>%
  pull(k_means) %>%
  pluck(1)

# Next, we'll combine the clusters from the `k_means_obj` with the `umap_results_tbl`.

umap_kmeans_results_tbl <- k_means_obj %>% 
  augment(stock_date_matrix_tbl) %>%
  select(symbol, .cluster) %>%
  left_join(umap_results_tbl, by = "symbol") %>%
  left_join(sp_500_index_tbl %>% select(symbol, company, sector),
            by = "symbol")

# Plot the K-Means and UMAP results.

umap_kmeans_results_tbl %>%
  ggplot(aes(V1, V2, color = .cluster)) +
  geom_point(alpha = 0.5) +
  theme_tq() +
  scale_color_tq()

Congratulations! We are done with the 1st challenge!